A Data Analytic Framework for Unstructured Text Hassanin
نویسنده
چکیده
This paper describes a systematic flow of the unstructured data in industry, collected data, stored data, and the amount of data. Big data uses salable storage index and distributed approach to retrieve required information. Therefore, the paper introduces an unstructured data framework for managing and discovering using the 3Vs of big data: variety, velocity, and volume. Different approaches for managing, collecting, and classification of twitter data, e-mail data and free text are required to manage resources more efficiently, and building software platform around scalable analytics. The development processes in this paper is implemented in Python, build up lexicon and calculated sentiment score. Analyzing twitter data and e-mail data answered many of questions; what are people talking about?, what is the most important? ... etc. The accuracy of the proposed classifier was 77.78, without stop words and was 78.76 and 79.94 with stop words (25 and 174) respectively. If the stop words are increased, the accuracy will be 87.69.It has been 10% better accuracy between Naïve Bayes and Maximum Entropy classifier. [Hassanin M. Al-Barhamtoshy and Fathy E. Eassa. A Data Analytic Framework for Unstructured Text. Life Sci. J 2014; 11(10):339-350] (ISSN: 1097-8135). http://www.lifesciencesite.com. 48
منابع مشابه
Arabic OCR Segmented - based System Hassanin
A new investigation in the Arabic OCR system has presented for the offline recognition of machineprinted cursive words. Therefore, a reliable transformation mechanism will be used to transform image text into free text (ASCII or Unicode Texts), that can be directly searched by a computer. Therefore, traditional preprocessing model (segmentation phase) will be included to extract each word from ...
متن کاملAssessing the Quality of Unstructured Data: An Initial Overview
In contrast to structured data, unstructured data such as texts, speech, videos and pictures do not come with a data model that enables a computer to use them directly. Nowadays, computers can interpret the knowledge encoded in unstructured data using methods from text analytics, image recognition and speech recognition. Therefore, unstructured data are used increasingly in decision-making proc...
متن کاملNonparametric Regression Estimation under Kernel Polynomial Model for Unstructured Data
The nonparametric estimation(NE) of kernel polynomial regression (KPR) model is a powerful tool to visually depict the effect of covariates on response variable, when there exist unstructured and heterogeneous data. In this paper we introduce KPR model that is the mixture of nonparametric regression models with bootstrap algorithm, which is considered in a heterogeneous and unstructured framewo...
متن کاملTurning Quantitative: An Analytic Scale to Do Critical Discourse Analysis
Critical Discourse Analysis (CDA) could be seen as a theory in qualitative more than in qualitative stud- ies. This might have led to difficulty in doing CDA. Accordingly, this study attempted to develop a quan- titative profile in the form of an analytic rubric. For this purpose, Fairclough’s model of CDA was select- ed as the research framework. The techniques used for structuring analy...
متن کاملA generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment
The management of unstructured data is recognized as one of the major unsolved problems in the information industry and data mining paradigm. Unstructured data in computerized information that either does not have a data model and there are not easily usable by data mining. This paper proposes a solution to this problem by managing unstructured data in to structured data using legacy system and...
متن کامل